Gaussian Robust Classification 



A thesis submitted in partial fulfillment of the 
requirements for the degree of Master of Science 



by 

Ido Ginodi 



Supervised by Dr. Amir Globerson 



December 2010 



The School of Computer Science and Engineering 
The Hebrew University of Jerusalem, Israel 



Abstract 

Supervised learning is all about the ability to generalize knowledge. Specifically, the goal 
of the learning is to train a classifier using training data, in such a way that it will be capable 
of classifying new unseen data correctly. In order to acheive this goal, it is important to 
carefully design the learner, so it will not overfit the training data. The later can be done 
in a couple of ways, where adding a regularization term is probably the most common 
one. The statistical learning theory explains the success of the regularization method by 
claiming that it restricts the complexity of the learned model. This explanation, however, 
is rather abstract and does not have a geometric intuition. 

The generalization error of a classifier may be thought of as correlated with its ro- 
bustness to perturbations of the data. Namely, if a classifier is capable of coping with 
distrubance, it is expected to generalize well. Indeed, it was established that the ordinary 
S VM formulation is equivalent to a robust formulation, in which an adversary may displace 
the training and testing points within a ball of pre-determined radius ( |Xu et al. [2009]). 



In this work we explore a different kind of robustness. We suggest changing each data 
point with a Gaussian cloud centered at the original point. The loss is evaluated as the 
expectation of an underlying loss function on the cloud. This setup fits the fact that in 
many applications, the data is sampled along with noise. We develop a robust optimiza- 
tion (RO) framework, in which the adversary chooses the covariance of the noise. In our 
algorithm named GURU, the tuning parameter is the variance of the noise that contami- 
nates the data, and so it can be estimated using physical or applicative considerations. Our 
experiments show that this framework generates classifiers that perform as well as SVM 
and even slightly better in some cases. Generalizations for Mercer kernels and for the 
multiclass case are presented as well. We also show that our framework may be further 
generalized, using the technique of convex perspective functions. 
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Chapter 1 
Introduction 



1. Motivation 

The ability to understand new unseen data, based on knowledge that was gained using a 
training sample, is probably the main goal of machine learning. In the supervised learning 
setup, one is given a training set, consists of data samples along with labels indicating their 
'type' or 'class'. The learning task in this case is to develop a decision rule, which will 
allow predicting the correct label of unfamiliar data. 

As the main goal is to be able to generalize, it makes sense to design the learning 
process so it reflects the conditions under which the classifier is going to be tested and 
used. In many real world applications, the data we are given is corrupted by noise. The 
noise may be either inherent to the process that generates the data or adversarial. Examples 
to an inherent noise include a noisy sensor and natural variability of the data. Adversarial 
noise is present for example in spam emails. In either way, it is vital to learn how to classify 
when it is present. We suggest to do it by preparing for the worst case. Amongst all noise 
distribution that have a bounded power (i.e. bounded covariance), the Gaussian noise is 
believed to be the most problematic, since it has the maximal entropy. 

By designing a classifier that is robust to Gaussian noise, we are able to learn and gener- 
alize well, without the need to introduce an explicit regularization term. In that respect, our 
work aims at shading more light on the connection between robustness and generalization. 

2. The supervised learning framework 

Formally speaking, the supervised learning setup consists of three major components: 

1. Data. We denote X the sample space, in which the data samples live (i.e. the objects 
one tries to classify, e.g., vector representation of handwritten digits). Alongside the 
sample space, we are given the label set, denoted y . This set contains the various 
classes to which the data points may be assigned (e.g., 0, 1, . . . , 9 in the handwritten 
digits example). A distribution V is defined over X x y, and dictates the probability 
to sample a data point x G X along with a label y E y. In our discussion we will 
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restrict ourselves to the Euclidean case, namely X = R d . Unless stated otherwise, 
we assume a binary setting, in which y = {+1, —1}. 

2. Hypothesis class. In the learning process, one considers candidate hypotheses taken 
out of the class H. This class consists of functions from X to y. Its contents reflect 
some kind of prior data about the problem at hand. A well known example is the 
class of half-spaces, defined as 

Uhaif-space = {<M^) = *gn(w T x)^ : R d -> {+1, -1}, w G R d } (1.1) 

3. Loss measure. The means to measure the performance of a specific instance h E H 
is the loss function, £ : X x y x % — > R+ The most intuitive loss function in the 
binary case is the zero-one loss, defined by 

£ -i(x,y;h) = t[h( x )^y] (1.2) 

The learning task is to find the classifier h* e % which is optimal, in the sense that it 
minimizes the actual risk, defined as 

err(h) = E {x ^ v £(x, y; h) (1.3) 

Most of the times, however, it is the case that V is unknown. Even in the rare cases in 
which it is known, it is not always possible to optimize the expectation over it. The learner 
is thus given a training set S C X x y of i.i.d. samples. The learning task in that case is 
to minimize the empirical risk, defined as 

1 M 

errf^-^^t,™^) (1.4) 

m=l 

where S = {(x m , y m )}^ =1 . This technique is called empirical risk minimization (ERM). 
It is important to keep in mind that although the technical tool is ERM, the objective is 
always to have the actual risk as low as possible. 

Sometimes, however, this is not the case. That is, in spite of the fact that the learned 
decision rule is capable of classifying the training data, it fails to do so on fresh test data. In 
this case we say that the generalization error is high, although the training error is low. The 
reason for such a failure is most often overfitting. In this situation, the learned classifier fits 
the training data very well, but misses the general rule behind the data. In the PAC model, 
overfitting is explained by a too rich hypothesis class. If the learner can choose a model 
that fits perfectly the training data - it will do so, ignoring the fact that the chosen model 
will possibly not be able to explain new data. Say for example that the hypothesis class 
consists of all the functions from the sample space to the labels space. A naive learner 
might choose a classifier that handles all the training points well, whereas any unknown 
sample is classified as +1. This selection might obviously have erroneous results. In the 
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spirit of this idea, the PAC theory bounds the difference between the empirical and the 
actual risk using a combinatorial measure of the hypothesis class complexity, named VC 
dimension. For a detailed review see Vapnik [ 1995) . 



A common solution for this problem is to add a regularization term to the objective 
of the minimization problem. Usually, a norm of the classifier is taken as a regularization 
term. From the statistical learning theory's point of view, the regularization restricts the 
complexity of the model, and by that controls the difference between the training and test- 



ing error (Smola et al. [1998|; Evgeniou et al. [2000|; Bartlett et al. [2002|). The idea of 



minimizing the complexity of the model is not unique to the statistical theory, and may 
be traced back to the Ocaam's razor principle: the simplest hypothesis that explains the 
phenomenon is likely to be the correct one. Another way to understand the regularization 
term, is as a means to introduce prior knowledge. 

3. Support Vector Machines 

In support vector machine (SVM), the loss measure at hand is the hinge-loss 

£ hinge (x m ,y m ;w) = [l-y m w T x m ] + 

(■hinge is a surrogate loss function, in the sense that it upper-bounds the zero-one loss. Fur- 
thermore, (hinge is convex, which makes it a far more convenient objective for numerical 
optimization than the zero-one loss. Note that the hinge loss intorduces penalty when the 
classifier correctly predicts the label of a sample, but does so with too little margin, i.e. 
w T x m < 1. The penalty on a wrong classification is linear in the distance of the sample 
from the hyperplane. 

As discussed, optimizing the sum of the losses solely may result in poor generaliza- 
tion performace. The SVM solution is to add an L 2 regularization term. The geometrical 
intuition behind this term is the following: The distance between the point x rn and the 
hyperplane w T x = b is given by 

\w T x m — b\ 



\w\ 



(1.5) 



One may scale w and b in such a way that the point with the smallest margin (that is, the 
one closest to the hyperplane) will have \\w T x m — b\\ = 1. In that case, the bilateral margin 



is tt^tt (see Figure 1.2 1. This geometrical intuition, along with the fact that the hinge loss 
punishes too little margin, motivates the name Maximum Margin Classification that was 
granted to SVM. Hence, the SVM optimization task is 

x M 

min -|H| 2 + V [1 - y m {w T x m - b)] + (1.6) 

w, b 2 

m=l 

The parameter A controls the tradeoff between the training error and the margin of the 
classifier (cf. Section [5]). 
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Zero-one loss 
Hinge loss 




Figure 1.1: The hinge loss is a convex surrogate to the zero-one loss. 




Figure 1.2: The bilateral margin is tt-^-it. Thus, minimizing ||tu|| results in maximizing the margin. 



4. Robustness 

The objective of the learning is to be able to classify new data. Thus, being robust to 
perturbations of the data is usually a desirable property for a classifier. In some cases, the 
training data and the testing data are sampled from different processes, which are similar 
to some extent but are not identical ( |Bi and Zhang [ |2004| ). This situation can happen also 
due to application specific issues, when new samples are sampled with reduced accuracy 
(for example, the training data may be collected with an expensive sensor, whereas cheaper 
sensors are deployed for actual use). 

Even harder scenario is the one of learning in the presence of an adversary that may 
corrupt the training data, the testing data or both. The key step in order to formulate the 
robust learning task, is to model the action of the adversary, i.e., to define what is the family 
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of perturbations that he may apply on the data points. In the Robust-SVM model, the 
adversary may apply a bounded additive distrubance, by displacing a sample point within 
a ball around it (Shivaswamy et aL] [|2006[). This case is referred to as box-uncertainty. 



Glob erson and Roweis| [|2006] assumed a different type of adversary. In their model, named 
FDROP, the adversary is allowed to delete a bounded number of features. This model 
results in more balanced classifiers, which are less likely to base their prediction only on a 
small subset of informative features. 

Two issues usually repeat in robust learning formulations. The first one is the problem 
of the adversarial choice. Most of the times, the first step in the analysis of the model is 
characterizing the exact action of the adversary on a specific data sample, given specific 
model parameters. The Robust-SVM adversary will choose to displace the point perpen- 
dicularly to the separating hyperplane. FDROP's adversary will delete the most informa- 
tive features, i.e. those that have the maximal contribution to the dot product between the 
weights vector and the data point. The second issue is the restriction on the adversary's 
action. Regardless of the actual type of perturbation that the adversary uses, one needs to 
bound the extent to which it is applied. If no constraint is specified, the adversary will 
choose his action in such a way that the signal to noise ratio (SNR) will vanish, and the 
data will no longer carry any information. In the Robust-SVM formulation, the adversary 
is constrained to perform a displacement within a bounded ball. In the case of FDROR no 
more than a pre-defined number of features may be deleted. 

Note that robust formulations are closely related to the notion of consistency. A classi- 
fier is said to be consistent, if close enough data points are predicted to have the same label. 
Different adversarial models befit different notions of distance. For example, the box- 
uncertainty model is related to the Euclidean metric and feature deletion suits the hamming 
distance. 

It should be mentioned that robustness has quite a few meanings in the literature of 
statistics and machine learning. In this work, we use robustness in the sense of robust 
optimization (RO), i.e. minimizing the worst-case loss under given circumstances. 



5. Between robustness and regularization 



The fact the robustness is related to regularization and generalization is not too surprising. 
Indeed, first equivalence results have been established for learning problems other than 
classification more than a decade ago ( |Ghaoui and Lebret] [ | 1997[ | ; |Xu et al.| f2008) ; [Bishop 
[1994]). Recently, |Xu et al.| [ |2009[ have proven the fact that the regularization employed by 
SVM is equivalent to a robust formulation. Specificly, they have shown that the following 
two formulations are equivalent 



M 



minAlH + V [1 - y m {w T x m - 6)]. 

w.b — ' 



m=l 



M 



min max V [1 - y m (w T (x m - S m ) - 6)1. 
W> £ m IIM*<*~ 



9 



where || ■ ||* is the dual- norm. This equivalence has a strong geometric interpertation, and 
sheds a new light on the function of the tuning parameter of SVM. Using the notion of 
robustness, a consistency result for SVM was given, without the use of VC or stability 
arguments. The novelty of that work stems from the fact that most previous works on 
robust classification were not aimed at relating robustness to regularization. Rather, the 
models were based on an already regularized SVM formulation, in which the loss measure 
was effectively modified. 



6. Our contribution 

In this work we adopt the idea of using robustness as a means to achieve generalization. We 
present a new robust-learning setup in which each data point is altered by a stochastic cloud 
centered on it. The loss is then evaluated as the expectation of an underlying loss on the 
cloud. The parameters of this cloud's distribution are chosen in an adversarial fashion. We 
analyze the case in which the adversary is restricted to choose a Gaussin cloud with a trace- 
bounded covariance matrix. Then we show that this formulation culminates in a smooth 
upper- approximation of the hinge loss, which gets tighter as the cloud around each data 
sample shrinks. This loss function can be shown to have a convex perspective structure. By 
deriving the dual problem, we are able to demonstrate a method of generating new smooth 
loss functions. Our algorithmic approach is to directly solve the primal problem. We show 
that this yields a learning algorithm which generalizes as well as SVM on synthetic as well 
as real data. Generalizations to the non-linear and multiclass cases are given. 



7. Related work 



Other works have incorporated a noise model into the learning setup. For example, [baptiste 
Pothin and Richard [ 2008[ have warped the data points with ellipsoids. Pozdnoukhov et al. 
[ 2005] have shown how to train classifiers for distributions. Similar to what we do in this 
work, they use tails of distributions in their derivation. Their work, however, treated each 
data class as a distribution, whereas in this work we attach a noise distribution for each data 



point separately. Bhattachar yya et ah] [ |2004b[ ; |Shivaswamy et al.| ]2006[ | have employed 
second order cone programming (SOCP) methods in order to handle the uncertainty in 
the data. |Bhattacharyya et aL [ 2004a[ | have assumed stochastic clouds instead of discrete 
points, as we do, but they did not try to minimize the expectation of the loss function over 
the cloud. Instead, their idea was to incorporate the idea with the soft margin framework. [Bi] 



and Zhang [2004] have tried to learn a better classifier by presenting the learning algorithm 
'more reasonable' samples. We elaborate on this model in Appendix |A} 

Smooth loss function were studied by Zhang et al. [ 2003[ ; Chapelle [2007 1. Analysis 
of methods for Solving SVM and SVM-like problems using the primal formulation was 
done by |Shalev-Shwartz et al.| [ |2007aB ; |Ch^eTTe| pOOTl . 



The rest of this document is organized as follows: in Chapter [2] we present our framework 
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formally, derive the explicit form of the smooth loss function and devise an algorithm that 
finds the optimal classifier. In Chapter [3] we derive a dual formulation for the problem, and 
point out that our model may be generalized for other loss functions. In Chapter|4]we apply 
the kernel trick and devise a method for training non-linear classifiers in the same cost as 
for the linear kernel. Chapter [5] contatins a generalization of the binary algorithm for the 
multiclass case. At last, in Chapter [6] we discuss the contributions presented in this work 
and mention possible directions for future work. In Appendix [A] we discuss a far more 
basic version of resistance to noise. The results of the first section therein are not original 
and presented here only for the sake of logical order. The next section contains a simple 
generalization for the multiclass case. Appendix [B] gives the solution to the adversarial 
choice problem for an adversary that is restricted to spread the noise along the primary 
axes. At last, in Appendix [C] we explain why we find the usual multiclass hinge loss 
inapplicable in our framework. 
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Chapter 2 

Gaussian Robust Classification 



In this work we take the approach of robust optimization (RO). Accordingly, we present a 
min-max learning framework, in which the learner strives to minimize the loss, whereas the 
adversary tries to maximize it. The model that we introduce in this chapter has two layers 
of 'robustness'. Firstly, we use the min-max robustness, which lays in the foundations of 
RO. Secondly, we effectively enhance the training dataset by taking into consideration all 
the possible outputs of the adversarial perturbation. More concretely, we alter each training 
sample with a stochastic cloud. The shape of this cloud is chosen by the adversary from a 
pre-determined family of distributions. The spreading of the samples should be understood 
as adding noise, where different disturbances take place with different probability. The 
loss on each sample is finally computed as the expectation of an underlying loss on the 
respective cloud. 

1. Problem Formulation 

In this section we formally describe the model we investigate in the work. We take the 
hinge-loss as the underlying loss function, and build the learning framework on top of 
it. We then show that the new framewok we introduce is equivalent to an unconstrained 
minimization of an effective loss function. 
Recall that the hinge loss is defined 

t hinge (x m ,y m -w) = [l-y m w T x m ] + (2.1) 

We introduce the expected hinge loss 

£ E hinge (x™, y m ; w, V) = E n ^ v [l - y m w T (x m + n)} + (2.2) 

where V is a predefined noise distribution over the sample space. The optimization problem 
for learning a classifier w.r.t. the expected hinge loss is thus 

M 

mmJ2€n 9 e(x m ,y m ;w,V) (2.3) 

m=l 
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Granting an adversary the ability to choose the noise distribution, we end up with the fol- 
lowing formulation 



M 

min max V £l nge (x m ,y m ; w,V m ) (2.4) 



w ■D 1 xT> 2 x...xT> M eC 1 xC2X---xC M 

m=l 



where C m is the set of allowed noise distributions for the m th sample. In order for the 
adversarial optimization to be meaningful, all C m 's should have a 'bounded' nature. We 
now alter the order of maximization and summation, and write 



M 



min max £ E hmge (x m , y m - w, V m ) (2.5) 

m=l 

At last, we observe that the optimization task at hand is nothing else than optimizing the 
effective loss function 

y" 1 ' w > V ) = maxE n ^ [1 - y m w T (x m + n)] + (2.6) 
We refer to tj^ nge {x m , y m ] w, V) as the expected robust hinge loss. 

2. The adversarial Choice 



Equation 2.5 presents the general noise-robust formulation. In the following, we will derive 
an explicit loss function for a specific collection of noise distributions. We focus on the case 
in which the adversary is constrained to spread a Gaussian noise, having a trace bounded 
covariance-matrix. The motivation behind this constraint is physical. When a noise is 
modeled with a distribution, its covariance is considered as its power. Thus, by constraining 
the sum of the eigenvalues of the covariance matrix we bound the power that the adversary 
can spread. The Gaussian noise is the worst case noise, in the sense that amongst all 
distributions with a certain poer bound it has the maximal entropy. 

Using the notations of the previous section, we specify the restriction on the adversary 

as 

C = Cfs = {V ~ JV(0, S)|S G A^} 

where A^ = {£ G PSD|Tr(£) < /?}, i.e. Gaussian distributions having the zero vector as 
mean and a covariance matrix with a bounded sum of eigenvalues. 

In the next couple of sections we will characterize the adversarial choice of the covari- 
ance matrix and derive an explicit loss function. 

2.1 The structure of the loss function 

The following paragraphs are rather technical. For later use, we explicitly perform the 
integration of the robust hinge loss function. We then prove a monotony property of the 
integrated loss. This property will help us analyze the nature of the adversarial choice in 
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our case. The key observation throughout the derivation is that the multivarite expectation 
can be transformed to a univariate problem. 



We plug the notations that were introduced above into Equation 2.6 and get: 

£^ ge (x m ,y m ;w,E) = maxc|S|-i / e'^^ [l - y m w T {x m + n)] + dn (2.7) 
where c = (27r)~ d / 2 is the normalization constant. This is equivalent to: 

r^ ge (x m ,y m ;w,E) = magc|E|-i J [l - y m w T x m - y m w T n] + dn 

(2.8) 

As a first step in the analysis of the expected robust hinge loss, we shall handle the 
quantity 

g d =c|S|-^ [e-3 nT *- ln [l-y m w T x m -y m w T n] + n (2.9) 



Note that the above only depends on n via products of the form w T n. Therefore, we 
define a new scalar variable u = y m w T n. Equation 2.9 can now be viewed as the expected 



value of g{u) = [1 — y m w x m — u] + . The moments of u are 

Eu = Ey m w T n = 
= y m w T En = 



and 



Var[w] = Var[y m w T n] = 

= y m w T Var[n]y m w 
= (y m fw T ^w = 



Thus we get 



Q= , 1 / e'i^^ [1 - y m w T x m - u] du (2.10) 

Define erf (t) = ^= J 1 ^ exp(— \z 2 )dz. In addition, denote a 2 (w, S) = w T, Lw. The 
following proposition holds. 



Proposition .«=(!- «f[ ^£gf ) + exp 



, 2 N 



14 



Proof: We conduct a direct computation: 

1 f " 2 

Q = -. / e \l-y m w T x m -u]du = 

-y pi— y m w T x m u 2 

= / e -^l^) (1 _ y™ W T X m - U) du 

2 (w, E) J-oo 



V^TRX^E) 

i / rl—y m w 1 x 

1 / /_ ™ t 1 ™\ / 



/I— y-w X'" u 2 
-oo 



^27ra 2 (iy,E 

l-j/ m u; T tc m 



a/27T(X 2 (iI>, E) 7- 



By using the variable substitution theorem and observing that the remaining integrand is an 
odd function (thus the identity J"^ odd = odd holds), we conclude that 



Q = (1 - ^»-) erf ( + V^Sexp ' ^ " 



y V 2tt ^ ^ 2o- 2 (io,E) 

(2.11) 



Let us establish the following simple property of Q. 

Lemma 2 Q is monotone-increasing in ex 2 . 

Proof: The fundamental theorem of calculus yields that 



dt w v^tt V 2 
Using the chain rule we compute 



d , s 1 / t 2 \ 
erf(t) = — = exp -- (2.12) 



1 / (l-jTigae^); , 

ex P 7^7 — ^ (2.13) 



da 2 2V2^V^ \ 2<7 2 (>,E) 
It is evident that for all a 2 > 



At 2 

i.e. Q is monotone-increasing in a 2 . 
I 



" > (2.14) 
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2.2 The optimal covariance matrix subject to a trace constraint 

We will now focus on finding the optimal adversary, i.e., performing the maximization of 



Equation 2.8 over the range of allowed covariance matrices. The next theorem specifies 
which covariance matrix attains the worst-case loss. In out terminology, refer to this result 
as the adversarial choice. 



Theorem 2.1: The optimal £ in Equation 2.8 is given by S* = /?Si2 where the optimiza- 
tion is done over £ G A^. 

Before actually proving the theorem, we will give some geometric intuition. The idea 
behind the expected loss is to replace the original sample point with a Gaussian cloud 



centered at the original point (Figure 2.1 1). Consider an arbitrary displacement x m = 
x m + n. For fixed w, n can be written as n = n\\ + n±. The relevant quantity is 
w T x m = w T x m + w T n\\, that is, the orthogonal component does not have any effect. 
Accordingly, it makes sense that the optimal noise direction is orthogonal to the separating 
hyperplane, i.e. parallel to the vector w (see Figure |2Tp ). 



(a) 



(b) 



(c) 



Figure 2.1: (a) Replacing the sample point with a Gaussian cloud, (b) The optimal noise direction 
is orthogonal to the separating hyperplane. (c) The expected robust hinge loss only considers the 
tail of the distribution, i.e. the points that suffer a margin error. 



The proof of Theorem |2. 1 1 applies simple algebraic results to establish this result rigor- 
ously. 



Proof: Plugging Proposition [T] into Equation 2.8 we get 



max 



1 - y m w T x m 
(1 - y m w T x m 



(1 - y m w T x m ) erf 



+ 



27 



exp 



The above depends on S only via a 2 (w,Tj). According to Lemma [IJ the objective 
is monotone increasing in a 2 . Therefore, the adversary would like to choose £ so that 
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a 2 (w, E) is maximized. By applying the Cauchy-Schwartz inequality, we conclude that the 
maximum value of cr 2 (w, E) is A max (E) \\w\\ 2 . For all E 6 A/3 it holds that Tr(E) < (3. 
Since all of the eigenvalues are positive, it holds that A max < (5 as well. Consider the 
candidate solution E = Since cr 2 (u>,E ) = /3||u>|| 2 , this selection attains the 

maximum. Note that this covariance matrix reflects the fact that the adversarial choice 
would be to spread the noise parallel to the separator. I 



3. A smooth loss function 

In the previous sections we have done the technical computations needed in order to derive 
the robust hinge loss explicitly, and found the optimal covariance matrix. In the following 
we will put these results together, and present an explicit formulation of the loss function 
resulting from our model. In addition, it is shown that our robust loss can be represented 
as a perspective of a scalar smooth approximation of the hinge loss. By analyzing this 
function we are able to gain a better understanding of ty^ nge . We conclude this section by 
showing that our loss function is a smooth convex upper- approximation of the hinge-loss. 
When the 'diameter' of the noise cloud is shrunk, t^ nge coincides with the hinge-loss. 
We devote a notation for the result of Proposition [j] 

("1 7TL T TYl \ 

V ™ - j 

a ( (l-y m w T x m Y 

+ exp 1 



By combining the above equation with the result of Theorem 2.1 we conclude 



/ 1 _ y rn .T T m \ 



v^lHI ( (i-y m w T x m Y 

+ exp 1 



2^ *\ 2/?||H| 2 

(3 has the meaning of statistical variance, and therefore in the following we will replace it 
with cr 2 (not to be confused with cr 2 (w, E)). In order to understand the nature of the loss 
function we have defined, it is suggestive to define 

f(z) = zorf(z) + -^=e-£ (2.15) 

V27T 

Using /, the robust expected hinge loss can be written as 
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A direct computation shows that 

df 



ert(z) H 2 — 



z 

e 2 



erf (2) 



dz V27T V27T 

We are now ready to prove a simple yet fundamental property of /. 

Theorem 3.1: / is a smooth strictly-convex upper- approximation of the hinge loss. 



(2.17) 




Figure 2.2: The function / is a smooth approximation of the hinge loss 

Proof: Denote the hinge loss h(z) = [z} + . We must show that 
1 . / is strictly-convex 

2- f(z) > h(z) 

3. lim^-oo f(z) - h(z) = 

4. lim^oo f(z) - h(z) = 
Differentiating Equation |2 . 1 7 1 once again, we get 

d 2 f 1 

dz 2 



2tt 

which is clearly positive for all z. Thus, / is strictly-convex. 
For the upper bound property, notice that f(z) > for all z £ 



(2.18) 



Hence, for z < we 
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2.17 



we 



simply have f(z) > = h(z). For the complementary case, denote the difference funct ion 

2 

over the positives 8(z) = f(z) — h(z) = zerf(z) + ;y|^ e ~^~ — z - Using Equation 

dh 

dz 



obtain 



erf (2) — 1 



(2.19) 



It can be easily seen that ^ < 0, i.e. h is monotone decreasing. Observe that 5(0) = 



and Hindoo 8(z) = 0. Since all the functions involved are continous, we conclude that for 
z > it holds that h(z) < f(z) < h(z) + -^=. Altogether, we have established the upper 
bound property. 

For the asymptote at z — y —00, observe that from l'Hopital 







lim zerf(z) = lim v27re 2 



Since the exponent in the right summand of / decays as well, we have that at z — > —00 
both f(z) and h(z) coincide on z = 0. 

For the asymptote at z — V 00, we must show that / asymptotically coincide with the lin- 

ear function z. To this end, let us write f(z) — z = z (erf(z) — 1) + 2 . Apply- 

ing l'Hopital rule along with the asymptotic behavior of the exponent, we deduce that 
lirn^oo f(z) — z = 0, as desired. | 



Next, we will analyze the relation between / and l™\ nge 



Definition 3 Perspective of a function (from Boyd and Vandenberghe ^2004^ 3. 2. 6 J. 

Iff : R n — )■ H, then the perspective of f is the function g : M n+1 — y R defined by 



g(t,x) = tf 



x 



with domain 



dom(g) = — G dom(f),t > o| 



The following lemma if useful. For a proof see |Boyd and Van denberghe (2004 1 3.2.6, e.g. 
Lemma 4 If / is convex (concave), then its perspective is convex (concave) as well. 
Define the function 



g(a,b) = af 



(2.20) 



Lemma |4| implies that g (a\\w\\, 1 — y m w T x m ^j is jointly convex in both its arguments. 
In order to establish the strict-convexity of P^ nge in w, we need a more powerful tool. 
Consider the following lemma ( |Boyd and Van denber ghe| p004| 3.2.4) 
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Lemma 5 Let h : R k ->■ E, g { : R n R. Consider the function f(x) = h(g(x)) = 
h (gi(x) , g 2 (x) , ...,g k (x)). Then, / is convex if h is convex, h is nondecreasing in each 
argument, and g t are convex. 

This lemma can be easily generalized to the case of strictly-convex functions. The proof is 
identical to that of the original version, thus will be skipped. 
We are now ready to prove the following theorem 

Theorem 3.2: P^ nge is strictly-convex in w. 

Proof: From Lemma[4j g is convex. In additon, g is nondecreasing in each of its arguments. 
To see that, observe that 



da \a I a \a J J2n V s a 



db \a J \a 

which are both strictly positive. 

a\\w\\ and (l — y m w T x m ) are both convex in w, thus we conclude by applying Lemma|5 
I 



The next theorem explores some of the other properties of the loss function we have 
defined: 

Theorem 3.3: P^ nge is an upper- approximation to the hinge loss. Furthermore, when a 2 — > 
0, the loss function coincides with the hinge loss. 



Proof: For the upper bound property, we apply Theorem 3.1 



. 1 — y l w x l \ ( 1 — y' l w x' 

a\\w\\j I ii — ii — > (j||iy||/i 



^1 — y l w T x l 



= a\\w 
= [1 - y i w T x i ] 
For the second part of the theorem, let us first observe that 



(7 ID 



a ( (l-y m w T x m Y 

exp — i y -— '- -> (2.21) 



as a multiplication of two vanishing factors at a — ^ 0. We consider two cases 
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1.1- y m w T x m > 0. Observe that 

(I — y m w T X m \ 

erf I n — n ) — > erf(oo) = 1 

\ cr\\w\\ J 

Thus, £ r h ? nge {x m , y m ; w, a 2 ) 1 - y m w T x m . 

2. 1 — y m w T x m < 0. In this case 

(\ - y m w T x m \ 
erf -> erf -oo = 

V o \\ w \\ J 

Thus, t h ? nge (x m ,y m ;w,a 2 )^0 

Altogether, we have shown that when a — > 0, ^^ ge (* m , y m ] w, a 2 ) — > [l — ?/ m t(; T a; m ] . 
I 

Observe that at = the loss function is not continuous. The discontinuity is remov- 
able, however, so this issue does not pose any problem. 




Figure 2.3: ( r h °^ ige is a convex upper-approximation to the hinge loss. As a tends to 0, P£inge ten ds 
to the hinge. In all of the graphs, the norm was set to 1. 

The norm of the classifier ||io|| always appears in a multiplication with a. Thus, we 
observe that it has a similar function. Namely, it controls the tightness of the approximation 
of the smooth loss function to the hinge. Since a is pre-determined, the optimal norm 
should reflect some kind of compensation. We thus conjecture that there exist a inverse 
ratio between a and the optimal norm (cf. Chapter [4]). 
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At last, it should be noted that this loss smooth function can be viewed as a multiplica- 
tive regularized loss. 



4. GURU: a primal algorithm 

We are now ready to devise an algorithm that solves our learning problem. In this section 
we describe a stochastic gradient descent (SGD) method that minimizes the strictly-convex 
loss function at hand. A convergence result for the algorithm stems from general properties 



of SGD that were studied extensively (see |Shalev-Shwartz et al. | 


2007b |; Kivinen et al. 


[2003]; Zhang et al. 


[2003 


1; 


Nedic and Bertsekas 


[2000]; 


Bottou and Bousquet [2008], 



e.g.). 



nal optimization task (Equation 2.5 ), we get 



Plugging the robust hinge loss function we have derived (Equation 2.15) into the origi- 

(l -y m w T x m )' 



Al 



mm 



J2 t 1 - y m w T x m ) erf 



m=l 



1 — y m w T x m 



(7 10 

H 7=- exp 



cr\\w\\ J 1/27T " \ 2cr 2 ||iy|| 2 

(2.22) 

This formulation is a convex unconstrained minimization task. One very natural approach 
for solving this kind of task is using the family gradient descent methods. Denote the 
objective of the optimization as 



G(w) = 5>(w) 
In batch gradient descent, in each step the algorithm updates 

w <— w — i]VG(w) 



(2.23) 



(2.24) 



In stochastic gradient methods the gradient is approximated as the gradient of one of the 
summands. Thus, the algorithm first randomizes an index i, then updates 



w <— w — r]Vgi{w) 



(2.25) 



where r] is the learning rate. The stochastic version suits settings of online learning, in 
which the learner is presented one training sample at a time. It has been suggested that using 



the stochastic version yields better generalization performance in learning tasks (Amari 



[1998J; Bottou and LeCun [2003]). 



Our algorithm, named GURU (GaUssian RobUst), optimizes Equation 2.22 using an 
SGD procedure. (For a full treatment see, e.g. Boyd (ref).) 

In order to derive the update formula, one should first calculate the gradient of the loss 
function. A straight forward computation yields 



1 



- y l w T x l 
a\\w\\ 



+ 



aw 



cxp 



1 



y' L w T x 1 ) 2 



2a 2 ||it>|| 2 



(2.26) 
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Name 


#Training 
samples 


#Cross- 

validation 

samples 


#Test 
samples 


#features 


Tovfa) 


200 


200 


200 


2 


Tov(b) 


200 


200 


200 


2 


Ionosp- 


100 


100 


152 


34 


here 










diabetes 


200 


100 


468 


8 


splice 


500 


400 


635 


60 


1 vs. 2 










USPS 


800 


700 


700 


256 


3 vs. 5 










USPS 


800 


700 


700 


256 


5 vs. 8 










USPS 


800 


700 


700 


256 


7 vs. 9 











Table 2.1: Description of the databased used in the binary case 



We therefore suggest the following SGD procedure 



Algorithm 1: GURU(S,r/ ,e) 



Data: Training set S, learning rate 770, accuracy e 
Result: w 

w J — 0; 

while AL > e do 

m rand(M); 

w «- w - ^ W w £Z b nge (x m , y m ; w, a 2 ); 

end 

return w; 



For convergence results see Nedic and Bertsekas [2000|. For a full treatment, see Bert 



sekas etal.H2003| , chapter 8. 



5. Experiments 

In this section we present experimental results that demonstrate the fact that GURU gener- 



alizes as well as SVM. Experiments were carried out on two toy problems (see Figure 2.4 
for a visualization), USPS handwritten digits classification (3 vs. 5, 5 vs. 8 and 7 vs. 9) 
and a couple of UCI databases ( prank and Asuncio"n] [ |2010[ ). The sizes of the data sets are 
detailed in Table 12. ll 
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Name 


GURU(%) 


SVM(%) 


Tov(a) 


92.5 


92.5 


Tov(b) 


92 


92 


Ionosp- 


82.24 


79.61 


here 






diabetes 


67.52 


67.31 


splice 


93.39 


92.44 


1 vs. 2 






USPS 


95.57 


95.86 


3 vs. 5 






USPS 


97.71 


98 


5 vs. 8 






USPS 


97.57 


97.43 


7 vs. 9 







Table 2.2: Summary of the results: GURU and SVM. 



We have trianed GURU for a values varying from 2 -20 to 2 20 , with exponential jumps. 
The learning rate was tuned empirically (values between 4~ 10 to 4 10 were tested). SVM was 
trained and tested using the SVM-light package. A values between 4 -15 and 4 15 were used. 
Note that in the SVM-light formulation, A multiples the loss and not the regularization 
term. Thus, the qualitative relation between A and a is roughly a ~ i Parameter selection 
was done based on the cross-validation set, and performance was evaluated for the optimal 



parameter on a testing set. The results are summarized in Table 2.2 



On the toy databases (a)-(b), the performance of GURU is identical to that of SVM. 
We have tested the learned classifiers' resistance to Noise, by adding uniformly distributed 



random noise to both cross-validation and test sets. The results are presented in Figure 2.5 



Observe that the resistance of GURU slightly outperforms that of SVM. Nontheless, this 
result gives an experimental support to the theoretical result in |Xu et ah [ 2009 1 , where it 
was shown that the ordinary SVM formulation is equivalent to a robust formulation, in 
which the adversary is capable of displacing the data samples. 

On the Ionosphere database, GURU significantly outperforms SVM. The samples of 
this database consist of radar reading. Thus, GURU's performance may be understood by 
the noisy nature of the samples. This finding supports the intuition that GURU perfoms 
well in noisy setups. 

On USPS, the performance of GURU is pretty similar to that of SVM. Since the samples 
can be easily visualized as images, it is convenient to examine the adversarial action in this 



case. Consider Figure 2.6 The GURU adversary is symmetric, in the sense that it may 
move the samples either closer or further from the separating hyperplane. Hence, some 
digits look even more clear that the original ones, whereas others look as the opponent 
digit. 
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(b) 



Figure 2.4: (a) Gaussian data, (b) Narrow Gaussian with outliers. 
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Figure 2.5: Classifiers' performance on a noised cross-validation and testing sets. The x-axis indi- 
cates the magnitude of the noise (noise distributes as U(—x, x)). The experiment was repeated 50 
times, (a)-(b) represent the respective toy problems. 
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Figure 2.6: The GURU adversary adds noise perpendicularly to the separating hyperplane. Note 
that some samples are even more clear than the original, whereas others look like the opponent 
digit. A bunch of samples are a superposition of both, (a)-(c) are the original digits, (d)-(f) are 25 
noisy samples. 
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(a) 



(b) 



(c) 



Figure 2.7: Classifiers' performance on a noised cross-validation and testing sets. The x-axis indi- 
cates the manitude of the noise (noise distributes as U(—x,x)). The experiment was repeated 10 
times, (a) 3 vs 5, (b) 5 vs 8, (c) 7 vs 9. 



In addition, on the USPS dataset, GURU has demonstrated a significantly better resis- 



tance to noise than SVM (see Figure 2.7 ). 
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Chapter 3 

Dual formulation 



In this chapter we derive a dual probelm for the learning task at hand. We do not use the dual 
as a means to solve the primal problem, since the primal optimization works well. Rather, 
we use it to gain a better understanding of the problem. In the course of the derivation we 
use the notion of conjugate functions. We will show that the dual problem itself specifies 
the classifier up to a scailing factor. Thus, we devise a method to extract the norm using 
the available information. It is interesting to observe that throughout the derivation of 
the dual, the smooth function / plays a specific and distinguished role. Thus, the entire 
procedure may be applied as is for other smooth convex function, by only calculating their 
conjugate dual. We demonstrate this principle in Section [2j where we also discuss the 
relation between the primal loss and the dual formulation. 



1. Mathematical Derivation 

This section is rather technical, and goes through the derivation of the dual. We start with 
the perspective representation of Py^ , and introduce copule of auxilliary variables. Using 
these variables, the Lagrangian takes a form that we are able to analyze. Theorem |1.1| 
encapsulates the effect of /, in such a manner that other loss functions can be plugged into 
the derivation rather easily. 

is 



The main result of this section is that the dual form of Equation 2.22 



maX Em a ™ 

s-t. || E m u m y m x m \\ < aJ2 m ^fcexp (- erf - 2 ^ ) (3.1) 
ot > 

In the following paragraphs we will go through the details. 



The optimization task Equation 2.22 may be written as 

1 — y m w T x' 



miner ir 
w ■ * — ' \ a\\w\ 



E/ I, I, (3-2) 
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We introduce the auxilliary varibles z m , and constrain them with 1 — y m w T x < z m . 
Note that f(z) is monotone increasing in z. Thus, at optimality z m = 1 — y m w T x. In 
addition, we introduce the variable r and constrain it with cr||iy|| < r. At the optimum 
r = cr||ii>||, since rf(-) is monotone increasing in r. Altogether we get the following 
optimization task 

s.t. cr || iA7 1| < r 

1 — y m w T x m < z. 
r > 



(3.3) 



where the optimization variables are w, z 1 , . . . , z n ,r. 

The objective is convex according to Lemma |4} and the constraint on w is a second 
order cone. 

To find the dual, write the Lagrangian: 

£(w,r,z,ac, A) = f (^—^ + X[a\\w\\ - r] + ^a m [l - y m w T x m - z m ] - fir 

m m 

where Xa m , /i > are the Lagrange multipliers. For later convenience we add a set of 
variables r m and force them all to equal r. So the new Lagrangian is: 

£(w,r,z,a, X,S) = ^r m f ( — ) + X[a\\w\\ - r] 

+ ^ a m [l - y m W T X m - Z m ] ~ HT - ^ S rn [ 



r m - r 



where S m > 0. 

Recall that we have defined g(a,b) = af (|) . Using this notion we get the following 
task 



in £(w, r, z, at, A, S) = min V] min [g(r m , 2 m ) - a m z m - <5 m r m ] 

m 

+A[a|HI -r] + ^a m [l - y m w T x m ] -fir + rj^^ 

m m 

= miny^g* (a m ;S m ) + X[a\\w\\ - r] 

w,r ' • 

m 

+ 53 «m [1 - y m ^ T a3 m ] - /ir + r J] ^ 



where g* is by definition the conjugate function of g (for details see Boyd and Vanden 



berghe| p004[ 3.3, e.g). Deriving the Lagrangian w.r.t. w gives: 

w 

w\ 



ffA FT = E a »A m (3-4) 
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Taking the norm of both sides of the equation yields 

a\(cx) = \\Y,a m y m x m \\ (3.5) 

m 

Substituting this back into the objective, the terms with w cancel out and we have: 

min^5(*(a m ; 5 m ) - r\(a) - fir + r (3.6) 

m m 

This is linear in r, thus deriving w.r.t. r yields a constraint ^2 m S m = X(a) + /i. Since 
/i > 0, the equality constraint might be relaxed to E) <5 m > A (a), and we end up with the 
following formulation 

max J2 m am + J2 m 9*(um,8m) 

s-t- E m ^>A(a) (3.7) 
a > 

Or: 

max E m a m + E m 9*KiU 

S-t- \\Em^my m X m \\ <ffE m i (3.8) 
CK > 

The overall problem has a concave objective (since it's a conjugate dual of a convex func- 
tion) and second order cone constraints. In what follows we work out the form of the 
conjugate dual g*. 

Denote by /*(«) the conjugate function of / (it is concave). The next theorem specifies 
the conjugate g* in terms of /*: 

Theorem 1.1: The conjugate dual of g(a, b) is 

*( a 5) = { ~ 6 (3 9) 

^ | — oo otherwise 



Proof: We must calculate 

g* (a; 5) = min (tf(-)- ax - 8t) (3.10) 

x,t \ t ' 

To prove, we change from variables x, t to a variables z = x/t, t: 

min tf(z) — azt — 5t = min t(f(z) — az — 5) (3.11) 

t>0,z t>0,z 

For the first case, assume that f*(a) > 5, which implies that for all z: 

f(z)-az>5 (3.12) 



Then in Equation 3.1 1 the minimization is always of the product of t > and some non- 
negative number. Hence it is always greater than zero, and zero can be attained at the limit 
t 0. 
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On the other hand if /* (a) < 5, we will show that there exists a pair t, z that achieves 
a value — oo: Since f*(ct) < 5 there exists a z for which 

f(z)-az-S <0 (3.13) 

If we take t — v oo and this z we get a value of — oo. | 

In order the complete the derivation of the dual formulation, we should compute the 
conjugate dual /*. The following lemma gives the desired result 



Lemma 6 The conjugate dual of / is 



I ( erf inv 2 (a) 



.^7T V 2 

Proof: Recall that: 



1 z 2 

f(z) = zerf(z) + —j=e~ (3.14) 

V27T 



and that its first derivative is 

df t( v 
& = erf( ' } 



(see Equation I2T7J). By Theorem |3.1[ / is convex, thus we compute /'s conjugate dual: 

/*(«) = min /(z) — az (3.15) 

z 

The minimum satisfies: 

f'(z) = a 
erf(z) = a 

z = erf inv (a) 

where erf in ,; is the inverse function of erf. We plug this equality into the objective and 
conclude 

f*(a) = (a)) - aerf inv (a) 

„ , x 1 ( erf inv (a) 2 \ 
= en inv {a)a + —j= exp I j - aerf inv {a) 

I Cllj r u 

exp 

It can be easily verified that /* is concave, as expected from the theory. Note that from the 
derivation above it follows that a,M < 1- I 



1 / ertinv (a) 
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(3.16) 



Taking the dual problem in Equation 3.8 and plugging in the conjugate duals derived 
above, we get: 

maX Em a m 

s-t. \\E m a m y m x m \\ <aE m *m 

-^exp(- erf ^^ ) >5 m 
Consider the following problem: 

maX Em a ™ 

S-t. IIEm^"^"! <^E m ^eX P ^ ^\a m ) ^ (31?) 
CK > 

The following proposition asserts that both of the formulations above are equivalent. 



Proposition 7 Equation 3.16 and Equation 3.17 are equivalent. 



Proof: Denote C\ the feasible region of Equation 3.16[ and C 2 the feasible region of Equa- 
tion 3.17 Let a G C\. Then trivially we have ol G C 2 . On the other hand, let a. G C 2 . 
Denote <5 m = -j^ exp(— ^erf inv (a m ) 2 ). It is easy to verify that this selection corresponds 



to a feasible point for Equation 3.16 (i.e. G C\) with the same objective value. 



As we have seen, the optimization problem we analyze in this work is a relative of 
the SVM problem. It is interesting to examine what happens when considering the duals. 
Consider the SVM formulation 

A||„.,||2 1 S^ M 



mm,,, 



Z^m=l Sri 



Its dual is 



s.t. Vm G {1,2,...,M} : f m > 1 -y m w T x m (3.18) 

min « Em=l «m " I Em,n=l U m a n y m y n ( X m ) T X n 

s.t. Vm G {1,2,...,M} : < a m < { v ' ' 

This dual form shares some properties with the dual form of GURU. For example, notice 
that in both cases one tries to maximize the sum of the dual variables a m . Another issue 
is that of the norm minimization. The SVM dual explicitly minimizes the norm of the 
classifier. In our dual, however, the situation is rather implicit: there exist a bound on the 
norm of the classifier. Without going into the details, we mention that moving a constraint 
into the objective or vice versa is possible in the context of Lagrangian duality. At last, 
notice that in spite of the fact that a and A play similar roles, increasing A results in srinking 
the feasible region of the SVM duak, whereas in our problem, increasing a expands the 
feasible region. 

The last issue we discuss in this section is the norm of the optimal classifier. Note that 
by solving the dual formulation, one can only get the optimal classifier up to a scailing 
factor. Of course, it is essential to know the norm exactly in order to be able to use the 
classifier. This goal can be achieved using the following theorem: 
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Theorem 1.2: The norm of the optimal classifier is 

1 



\w 



orf inv (a m *) + y m (w* )V 



(3.20) 



for every m, where w* is the normalized optimal classifier. 



Proof: Equation 3.4 may be written as 

lin > 



min C(w, r, z, a, A, 5) 

w,r,z 



mm > mm 

u>,r — ' z m ,r m 



r r . 



+A - r] + a m [l - y m w T x m ] - fir + rj^ <*« 

min > mil 

id, r ^— — ' r m 

m 

+X[a\\w\\ -r] + J2 a m [l-y m w T x m ] -fir + rj^^ 











r m min 






\' my 


'"m _ 





since r m > 0. We define g m = Since the equation above depends on z m only via q n 
we get 



min C(w, r, z, a, A, 5) 



min y min 



1 m 

min [/ (q 



+A[<t||w|| -r] + ^a m [1 - y m w T x m ] - /jr + r^S^ 



If when substituting the dual optimum in the Lagrangian, there exists a unique primal 
feasible solution, then it must be primal optimal (see |Boyd and Va ndenberghe [2004|, 5.5.5 
for details). Thus, at the optimum q* m = min 9m [/ (q m ) — a m q m ]. According to the proof 
of Lemma [6] it holds that q* m = erf inv (a m ). By exploiting the monotony properties of the 
problem (that were presented in the beginning of the section), we conclude that 

1 — y m w T x m 



a\\w\ 



(3.21) 



The desired result follows from basic algebraic operations. I 

Note that the values of the optimal a's are known, as well as the normalized vector 



w 



~^7t. Thus, we can compute the optimal norm. 



It is possible that the norm of the optimal classifier is bounded (as a function of a). 
Although we couldn't prove this result, we conjecture that such a result might stem from a 
strong duality argument: 



l«li = £^(* m ,v m ;«>V 



(3.22) 
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Figure 3.1: Norm of the optimal classifiers trained for the toy problems of Section [5] for various a 
values, (a)-(b) represent the respective toy problems. 



By plugging Equation 3.21 into the previous equality, we obtain 

l|a*||i 



\w\ 



T, m f( erf inv («m)) 



(3.23) 



A better understanding of the constraints on a may help bounding the RHS of the equation. 
We have plotted the norm of the optimal classifiers for the toy problems of Chapter [2] (refer 
to Section|5]for more details). The results are shown in Figure 3.1 and clearly support this 
conjecture. 



2. A general framework 

The dual form we have derived sheds some light on the structure of the problem. In this 
section we discuss the relation between the loss function / and the norm constraint that 
appears in the dual. We claim that there is a correspondence between approximations of / 
and relaxations of the dual problem. More specifically, approximations of the loss function 
culminates in approximations of the feasible region of the dual problem. 

The norm constraint in the dual is a core component of the optimization. We denote by 



s(a) 



exp 



erf,,, 



la 



(3.24) 



the function under summation. It is complicated to handle and understand s(a), thus it is 
appealing to approximate it using elementary functions. Two such approximations are 



Si (a) = H 2 (a) = -cdog 2 (a) - (1 - a) log 2 (l 
§2(0) = 4a (1 — a) 



a 



(3.25) 



(see Figure 3.2) 
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0.2 0.4 0.6 0.8 1 

a 

Figure 3.2: The dual constraint may be approximated using elementary functions. 



Note that in the previous section we only used / as a means to express g* (Equation 
3.9). Thus, if one replaces / with some alternative convex loss function /, the derivation 



of the dual will remain correct. Of course, the dual norm constraint will be affected by this 
change. 



In order to understand the nature of the approximations in Equation 3.25 it is necessary to 
explore the respective dual conjugates. 

Lemma 8 Let f2(z) = log 2 (l + 2 2 ). Then its conjugate dual is 

/2(a) = -alog 2 (a) - (1 - a)log 2 (l - a) (3.26) 
Proof: We compute / 2 's conjugate dual: 

/2(a) = mm f 2 (z) - az (3.27) 



The minimum satisfies: 



T 



- 1 
T 



a 

a 



1 — a 

a 



log. 



1 — a 
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We plug this equality into the objective and conclude 



log 2 1 + 



a 



a 



1 — a J 1 — a 

-a log 2 (a) - (1 - a) log 2 (l - a) 



as claimed. As in the case of our Gaussian robust loss, we have a < 1. 




Figure 3.3: Log loss apears naturally in our framework. In addition, we have demonstrated a means 
to generate some other loss functions, such as the quaratic loss above. 



Lemma 9 Let 



h{z) 



(z+4) 2 
16 



if z < -4 

if - 4 < z < 4 

if z > 4 



(3.28) 



Then its conjugate dual is 



7s(«) 



4a(l-a) ifO<a<l 
-oo if a > 1 



(3.29) 



Proof: It is easy to verify that / 3 is smooth. We thus compute /3's conjugate dual in the 
following way: 

/3(a) = min/ 3 (z) - az (3.30) 
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Extremum points satisfy: 



f' 3 (z)-a = 
-a if z < -4 

(5+4) _ a if_4<^<4 = 
1 - a if z > 4 



The above equation vanishes at z = 8a — 4. For < a < 1 we have — 4 < z < 4, thus we 
conclude 



#(«) 



0<a<l 



4a (1 - a) 



For a > 1 we take 2 — >■ oo, and /^(z) 
established the desired result. I 



2>4 



(1 — a)z — > — oo. Altogether we hace 



These lemmas shade some light on £ r h °n ge and on the structure of our problem. It turns 
out that the well-known log-loss as well as a quadratic loss that has the same flavour as the 



Huber loss appear naturally in our framework (see Figure 3.3 for a visualization). What 
we have demonstrated is that there exist a close connection between approximations of the 
primal loss and relaxations of the dual problem. Specifically, we have that the dual of 



M 



m=l 

is 



mm 

w * — ' " "" \ \\w\ 



max Em «m 

s-t. || E m amy m x m \\ < E««"W (3-32) 
a > 



Note, however, that this connection should be further investigated. It should be ob- 
served that not every smooth convex primal loss / yields a perspective that is convex in 
w. For that to happen, / should satisfy some mathematical properties that are yet to be 
understood. One example for such a condition is f(z) > z ( j z {z). Under this condition we 



can use the same reasoning as in the proof of Theorem 3.2 and conclude that the primal 
probem is convex. In this case, we can automaticaly apply the derivation presented in the 
previous section and deduce the respective dual problem. Another issue that should be 
better understood is the connection between approximations of / and the robust setup we 
have begun with. In particular, it is interesting to understand if the logarithmic loss may be 
interperted as resulting from RO. 
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Chapter 4 

Introducing Kernels 



One of the greatest stengths of the theory of support vector machines, is the simple gen- 
eralization to nonlinear cases. This generalization is carried out via the elegant notion of 
kernels. An examination of our derivation suggests that one may apply the kernel trick and 
introduce a means to learn nonlinear classifiers in Gaussian Robust framework. 

In this chapter we will develop a kernelized version of the GURU algorithm. Most of 
the derivation is straight forward: we begin by giving a representer result. Plugging the new 
parametrization of the classifer into the framework, we show that our update formulas are 
perfectly suitable for maintaining this kind of representation. The tricky part stems from the 
fact that our updates depend directly on the norm of the weights vector. Naive computation 
of the norm costs 0(M 2 ) operations, which significantly slows down the algorithm. We 
thus derive a procedure to update the norm in 0(1), based on previous computations. 

1. A representer result 

The first step towards kernelization of GURU, is to change our represention of the classifier 
from a weights vector (w) to a linear combination of the training samples. The theoretical 
justification of such an operations is known as a representer result. 

The fact that an optimal classifier may be represented as a linear combination of the train- 
ing sample, stems from the mathematical theory of Hilbert spaces. In our case, as well 
as in SVM, however, the same result can be derived using far more simple and explicit 
argumentation. In this section we will show three ways to establish the representer result 
for the case of GURU. In spite of the fact that we could prove the theroem using abstract 
argumentation, it is necsssary to develop the technical proof, as it lays the foundations for 
the derivation of the kerenelized algorithm. 

We start by stating a version of the representer theorem: 

Theorem 1.1: Let "H be a reproducing kernel Hilbert space with a kernel k : X x X — > R, 
a symmetric positive semi-definite function on the compact domain. For any function L : 
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— >■ R, and any nondecreasing function Q : R — > R. If 



J* = mm J(f) = mm {tl (\\f\\ 2 n ) + L (/(a*), f(x 2 ), . . . , f(x n ))} 
is well-defined, then there are some ot\, a 2 , ■ ■ ■ a n G R, such that 

n 

/(•) =^afK(ar<,-) 



(4.1) 



acheives J(f) = J*. Furthermore, if f2 is increasing, then each minimizer of J(f) can be 



expressed in the form of Equation 4.1 



For a proof and more details, see for example Scholkopf and Smola [2002]. 

As mentioned, we will discuss three techniques to establis the required result. First, 
using the structure of the updates that GURU perform. Second, by the derivation of the 
dual problem presented in Chapter[3j and third, using the general representer theorem. 



Theorem 1.2: There exists a solution of Equation |2. 22 that takes the form 



M 



W = ^ a my m X T 



(4.2) 



m=l 



Proof: Via the structure of GURU 

Recall that the updates in the GURU algorithm are of the form 



77 / • • / 1 — y l w T x l \ aw ( (1 — y l w T x t ) 2 \ \ 

w ^w--^[ -yVerf y — l — + -= exp y — - ' 

Vi\ 9 V *HI / V2^\\w\\ P V 2a 2 |HI 2 J J 

It is suggestive to observe that the update formula can be split and written as two successive 
steps. The first of which is 



77 aw ( (1 — y 2 w x 

w <— w — exp 



y/i V2tt|H| V 2a 2 \\w\\ 2 
followed by 

w ^ w + JLyixietf ( l ~y^ xt \ (4 3) 

The first step is nothing else then a rescailing of the weights vector 

77 a ( (1 — y t w T x 1 ) 2 \ 

w = -fw, 7 = 1 '-^= exp ^ — — — (4.4) 

y/i^\\w\\ V 2a 2 ||tu|| 2 J 
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Recall that GURU initializes the weight vector as w = 0, which clearly can be repre- 
sented as 

M 

o = °y mxrn ( 4 - 5 ) 

m=l 

We thus assume that the desired representation exists, and proceed by induction. By plug- 
ging the representation into the previous equations, we get 

M M M 

m=l m=l m=l 

i.e. for all m 

< W = (1 + T )a m (4.6) 

where a™ w is the result of thee respective update. The second step in the update formula 
(Equation 4.3 ), may be written as 

M M f i _ % T i \ 

Y cc w y m * m = E a ™y mxTn + * = Verf ( ZZ\ - ) 

m=l m=l v ^ MM/ 



i.e. 



ar = |°" (4-7) 
I a,i + it m — % 

Combining both steps, we end up with the following update rule: 

<' = H' I" 1 *'. (4-8) 

I 7«J + ^ 11171 — t 

Since GURU is guranteed to converge to the optimum, by taking t — > oo we establish the 
desired result. I 

Proof: Via the dual formulation 

We have already seen (Equation |3.4[ ) that 

w 
I w 



711 ^— ' 



By defining a m = -^a m and plugging it into the previous equality, we conclude that 

w = J2^yV 



III 



as required. I 

Proof: Via the general representer theorem 

Set Q = 0, L((f( Xl ), f(x 2 ), . . . , f(x n ))) = J2ti /(*<). f = fi r and ^ « b e ^ l^ear 
kernel £2) = scf £2- The desired result stems immidiately from Theorem |l.l| I 
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2. KEN-GURU: A primal kernelized version of GURU 

In the pevious section we have established a representer result for GURU. The next step in 
the derivation is to work the components of the algorithm, so the only dpendence on the 
data samples and on the classifier would be via dot products. That being the case, we can 
apply the kernel trick, namely to replace each dot product (x m ) x n with the kernel en- 



try K(x m , x n ) (for details see, for example, Aizerman et al. [ 1964 1; Scholkopf and Smola 



[2002]). We start by expanding the quantities that appear in the update formula in terms of 
a m 's. Then, we introduce a method to update the value of the norm variable in a computa- 
tionally cheap way. We conclude the section by putting the results together, and prsenting 
the KEN-GURU (KErNelized GaUssian RobUst) algorithm. 



In order to compute 7 and of Equation 4.4 and Equation 4.7, one must know the 



values of w T x l and Let us expand the first quantity 

' M 

^2 amy 



w T x % = > amV m x m 1 x' 



\m=l 
M 



a m y m (x m ) T x< 



= E 

m=l 
M 

= a my m K n 

m=l 

The norm might be computed as 

w t w = ij2^my m x m \ J2 a *y nxn 



M \ T M 



\w\\ 2 



\m=l J n=l 

M M 



EE 

m=l n=l 

M M 



a m a n y m y n (x m ) T x" 



Y a m a n y m y n K n 



m=l n=l 



Note that the Gram matrix K may be precomputed and cached (total cost of 0(M 2 )). 
Thus, w T x % can be computed in O(M), and ||iw|| mO(M 2 ). As both of these values should 
be computed for each update, the cost of the norm computation is extremely expensive. 
Instead of computing the norm each time from scratch, it is possible to use its previous 
value. The updated norm may be computed as 
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\w 



t+1 



M M 



EE^ +1 < + vv^ 



m=l n=l 
M 



E 

m=l 
M 

m=l n^i 



Ai 



E 



m=l 



A/ 



+ E a^+W^ 



m=l 



E E a^aTv m V n K mn + E ^ct+VtfK* 



E E a^ l a^ l y m y n K mn + 2 E c^af^yX 



By plugging Equation 4.8 we get 



\w\ 



t+i 



= 7 2 E E "^"W™ + 27 E <M + »i)y m y iK in 

M 

= i 2 \\w\\ 2 t + 2- n i i y i E «^ m ^ + /^X* 

m=l 

= 7 2 |Hl i 2 + 27/^WV + yU 2 i^ 

where u> T :r* is computed regardless of \\w || 2 . Thus, the value of the norm can be main- 
tained in 0(1). 

In may be easily observed that the data samples x m participate in the computations of 
the update only via the Gram matrix K. Thus, we can apply the kernel trick, and use 



(4.9) 
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for any Mercer Kernel k. Based on the results established in the previous sections, we may 
translate GURU into a kerenlized version, named KEN-GURU. 

We intoduce an auxilliary variable £, that holds the value of the product k(w,x 1 ) and is 
evaluated by 



At 



(4.10) 



m=l 



According to Equation |4.4[ Equation 4.7 and Equation |49 we introduce the following 
update formulas 



7t+i = 1 - 



T] a 
\/t \phxv t 



cxp 



(1-yXt+iY 

2a 2 v? 



71 erf f 1 " vKt+1 



Vt+l 



(4.11) 
(4.12) 
(4.13) 



Algorithm 2: KEN-GURUX<S,r/ ,e) 
Data: Kernel function k, training set S, learning rate r] , accuracy e 
Result: a 

/ /initializations 
forall m,n = l..m do 

end 

a <- 0; 

u 4- 0; 
t 4- 0; 

while AL > e do 

//randomize a sample 
i 4— rand(M); 

//evaluate coefficients 
Compute Q + i (Equation |4. 10 ); 
Compute 7 t+ i (Equation |4. 11 ); 
Compute fi t+ i (Equation |4.12 ); 

//update alphas 

aj+i <~ a t+i +Pt+i'f 

t<-t + l; 
end 

return a; 



The correctness of the algorithm stems directly from that of GURU. 
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Name 


GURU(%) 


SVM(%) 


Ionosp- 


83.55 


81.58 


here 






diabetes 


68.59 


66.67 


splice 


92.28 


92.28 


1 vs. 2 






USPS 


97.86 


98 


3 vs. 5 






USPS 


98.29 


98.71 


5 vs. 8 






USPS 


98.43 


97.86 


7 vs. 9 







Table 4.1: Results summary for KEN-GURU. 



3. Experiments 

In this section we present experimental results regarding the performance of KEN-GURU. 
We show how a affects the learned classifier and then compare KEN-GURU to SVM on 



USPS pairs and on the Ionosphere database (see Table 2.1 for details). For the USPS tasks, 
a polynomial kernel of degree 2 was used and for Ionosphere, RBF with 7 = 1. The results 
are summarized in Table 14. ll 



Consider Figure 4.1[ in which KEN-GURU classifiers trained for various values of the 



parameter a with a polynomial kernel of degree 2 are presented. The toy probelm was 
synthesized by first generating uniformly points on [—7.5, 7.5] x [—7.5, 7.5]. Points which 
fall within the ball of radius 2 around the origin were assigned a positive label. Points 
which are more distant from the origin than 3.5 units were taken as negative examples. 
Points which fell in between were dropped. Observe that increasing a puts extra emphasis 
on the number of samples in each class. Specifically, in the problem at hand, there are 
much more points outside the circle than inside. When a is rather small, the training is 
'local' in the sense that each sample governs what happens in its immediate environment. 
On the contrary, when a is relatively big, the emphasis is on global tendencies. 

On the Ionosphere databse, KEN-GURU performs significantly better than SVM. Re- 
call that the outperformance of GURU on SVM in this case is consistent with the perfor- 
mance in the case of a linear kernel. This behavior is explained by the noisy nature of the 
Ionosphere database. For the USPS couples, KEN-GURU's performance is pretty similar 
to that of SVM. 
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o=1s-10 o=1 s-7 o=1 s-3 




Figure 4. 1 : KEN-GURU performance on a radial data set. The green and red points indicate data 
points that were correctly classified (each color stands for one of the classes). Blue points indicate 
misclassification. The parameter a determines how distant is the effect of each data point. Note that 
for small values of a, the behavior of the classifier is determined locally by the samples. For rather 
big a, the effect is global, in the sense that the behavior of the classifier is determined by close as 
well as distant data samples. 
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Chapter 5 

The Multiclass Case 



In the previous chapters we have developed the binary algorhtm GURU, and its kernelized 
version KEN-GURU. In this chapter we will analyze anotther extension of the algorithm, 
for the case of multiclass cases. 

The ideas that were presented in Chapter[2]may be generalized for the multi-class case. 
To that end, we first should generalize the loss function we are working with. This goal is 
acheived by solving the generalized problem of the adversarial choice. After establishing 
this reuslt we devise the effective robust loss function, and devise an optimization algorithm 
for it. 

We relax the problem twice in order to solve it. First, we work with the sum-of-hinges 
loss function ( |Weston and W atkins [ ]1999[ ). In addition, we use a superset of noise dis- 



tribution, that contains all covariance matrix with a bounded maximal eigenvalue. By the 
end of the chapter we will prove that for the binary case the maximal eigenvalue and trace 
constraint give the same result. 

The setting we address in the followings is of data drawn from X = R. d , accompanied 
by labels drawn from y = {1,2, ... ,C}. The learning task is to train the weight vectors 
Wi, w 2 , ■ ■ ■ , w c . The target classifier is : X — > y, defined by 

(j)(x;w ll w 2 , . . . ,wc) = max \wT.x\ (5.1) 



1. Problem formulation 

In this section we formally describe the generalization of the learning task from the binary 
to the multiclass case. We show that the generalization culminates in a loss function which 
is the sum of several appropriate binary losses. 

In Chapter [2] we have started our derivation from the hinge loss 

W(* m , y m ; w) = [l- y m w T x] + (5.2) 

The most common generalization of the hinge loss to the multi-class case is 

imuit(x m , y m ;wi,..., w c ) = max \w T y x m - w T ym x m + S Vjir ] (5.3) 
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However, this loss function is not applicable in our framework (see Appendix [C]). Instead, 
we suggest to minimize the following surrogate loss function (Weston & Watkins, e.g. ref): 

Lum(x m , y m ; w u U7 3 , ... , w c ) = ^ I 1 - ( w y m - ^i) T x m ] + (5.4) 

which is a surrogate to the zero-one loss. 

Let us write down the formulation of the problem in this case: 

min y^maxE n ^(o,s) ^ [l - (w y m - w y i) T {x m + n)l , (5.5) 

m y'zfzy m 

where 

r r {EG PSD p(E) < /?} (5.6) 
and p is the spectral norm of a matrix, defined by 



p(A) = y/X msx (A*A) 

Using this set we constrain the maximal power of noise that the adversary may spread in 
each primary direction. 

2. The adversarial choice 

In the followings we will focus on deriving the adversarial choise for the problem at hand. 
It appears that in the current setup, the solution is simpler than the one we had in Chapter|2j 

2.1 Applying a spectral norm constraint 

Let us investigate what is the adversary's optimal way for spsreading the noise. The ideas 



of the development are similar to that of Theorem 2. 1 
Denote 



AW y , y , = w y -w' y (5.7) 



Using the same procedure we have employed in the binary case (see Section [2T| and Equa- 
tion 



2.15 thereby) we can write Equation 5.5 



as: 



min V max V L (x m ,+l; AW v m V , AWL „,£AW„™ „/) (5.8) 



m y'^y 



i.e. the task at hand is to optimize the effective loss function 

x m , y m ; Wl , w 2 ,... w c , 0) = max L i*™' + 1 - AW y m ,y', A ^« sA ™Vy) 

(5.9) 



(/rob l „m „.m 
"sum 



y'¥=y" 
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Observe that in every appearance of t^ nge , the label y m was replaced with +1. The reason 
for this change is that we are classifying using the weight vector w ym — w y /. That is, our 
prediction is 



(Wym - Wy,) T X m = Wy m X m - W^X" 1 



Our objective is, of course, to have w^ m x m > Wy,x m , which corresponds to the label +1. 
The next theorem specifies the adversarial choice of the covariance matrix E, and is the 



multi-class analog of Theorem 2.1 



Theorem 2.1: The optimal £ in Equation 5.9 is given by £* = (31. 



Proof: In Lemma [2] we have shown that L is monotone increasing in its A th argument. 
By the Cauchy-Schwartz inequality we have that 

AW^EAW^, < (3\\AW ym , y ,\\ 2 (5.10) 

On the other hand, it holds that for all y' 

AW^njPIAWjfnj = /3||AWVny|| 2 (5.11) 

hence this upper bound is attained for all C — 1 summands cuncurrenlty with £ = (31. 
The geometric interpertation of this result is that under the spectral norm constraint, the 
adversary will choose to spread the noise in an isothropic fashion around the sample point. 



We thus get the following optimization problem: 

min V V L(x m ,+l;AW y m yll /3\\AW ym y / \\ 2 ) (5.12) 

Applying the same terminology used in the binary case, we have: 

min E Cn 9e (* m ,+l;AWVy,/3) (5-13) 

in y'j^y m 

and Equation [579] equals 

£ r s ±(x m ,y m ;w u w 2 ,...,w Cl P)= ]T l^ nge (x m , +1; AW^y), (5.14) 

y'+y m 



47 



2.2 The connection to the trace constraint 

It is interesting to examine the reduction of the multiclass loss we have derived, to the 
binary case. Note that since we have used a substantially larger matrix collection, there is 
no apriori reason to expect that the results will coincide. 

Taking C = 2 brings us back to the binary case. We use w + ±, for the weight 
vectors of the classes. By expanding Equation |5.9[ we get 

r±(x m ,y m ;w +1 ,w^,P)=£Zi 9e (x m ,+l;w ym -w^ m ,P) (5.15) 
If we take w = w +1 — iu_l, we end up with 

CJ^.f;^,^^) = t h ° b n9e (x m ,y m ;w,P) (5-16) 

It is interesting to observe that the resulting loss functions are identical, even though 
the constraints we put on the convariance matrices are different. In order to explain this 
phenomenon, let us go back the geometric intuition that we have given prior to the proof of 
Theorem l2.ll 



(0,1) 




(1,0) 



Figure 5.1 : Visualization of Ai and Ti in the 2-dimensional case. The axes represent the eigenvalues 
of S. The dark shaded region contains all the matrices having Ai + A2 < 1, i.e. corresponds to Ai. 
The light area corresponds to Ti, and consists of all the matrices with max {Ai, A2} < 1. 



Consider Figure B.l which presents a visualization of Ap and Tp in the 2-dimensional 



case. What we have shown in Theorem 2. 1 is that the multiclass adversary will choose the 
point (1,1). Under the trace constraint, however, the adversary will have to choose either 
(1, 0), (0, 1), or any other point lying on the line connecting them. Our geometric intuition 
says that all the power that was not spread perpendicularly to the separating hyperplane 
is irrelevant. Thus, when the adversary has to choose a directional noise, he would take 
the perpendicular direction. On the other hand, if we limit his action axis-wise (and not 
overall), he will surely choose to spread the noise equally over all of the axes. 
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3. M-GURU: a primal algorithm for the multiclass case 

In the following we generalize GURU (that was presented in Section [4]) for the multiclass 
case. As a direct corrolary of the results presnted in previous chapters, we have that our 
loss function in this case is strictly-convex. Thus, we turn to devise an SGD procedure. 

We shall begin by computing the gradient of £™^(a; m , y m ; W\, w 2l . . . , wc, 13). For 
convenience, we write it in terms of the binary loss function ty^ nge : 



y fob / m m 




ffrob ( x n j+ l. Wj ^ 

(x n ,+V,w,P) 



if r = y m 
otherwise 



W=W,.m — W r 



Following the considerations that we have introduced in Section |4} we devise an SGD 
procedure for the minimization task: 

Algorithm 3: M-GURU(£,ry ,e) 
Data: Training set S, learning rate i] , accuracy e 
Result: w 

w J — 0; 

while AL > e do 

m rand(M); 
for?/ G {1,2,..., (7} do 



y 

end 
end 

return w ; 



In Algorithm [3} the notion of stochastic gradient was applied once, to the extent that 
our updates depend on a single sample in each iteration. It may be applied again, however. 
Instead of updating all the weight vectors concurrently, one might randomize which vector 
to update, as well. The resulting algrithm is 



Algorithm 4: M-GURU-5 2 (5,^ ,e) 



Data: Training set S, learning rate i] , accuracy e 
Result: w 

w J — 0; 

while AL > e do 

m rand(M); 

y > ^ rand(C) tiy w y , - ^ Wy ,£™J> m (x m , y m ; w u w 2 , w c , 0); 

end 

return w ; 
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Name 


#Training #Cross- 


#Test 


#features 


#classes 




samples 


validation 
samples 


samples 






Toy-3 


200 


200 


200 


2 


3 


Toy-4 


200 


200 


200 


2 


4 


USPS 


1200 


1050 


1050 


256 


3 


3,5,8 












USPS 


3000 


2000 


6000 


256 


10 


0-9 












splice 


1000 


1000 


1190 


60 


3 


wine 


50 


50 


78 


13 


3 



Table 5.1: Description of the databases used in the binary case 



4. Experiments 

M-GURU and M-GURU-^ 2 were tested on toy problems, USPS and a couple of UCI 



databases (Frank and Asuncion |2010|). Thedatasets are detailed in Table 5.1 In Toy-3 and 



Toy-4 each class is a Gaussian distribution. These problems are visualized in Figure 5.3 
The rsults are summarized in Table 15 .21 

Observe that the performance of M-GURU is similar to that of SVM. Nontheless, it 
should be noted that SVM slightly outperforms M-GURU. This difference is explained 
by the fact that M-GURU is based on the sum-of-hinges loss function, which is a looser 
surrogate of the zero-one loss than the SVM multi-hinge loss function. We have tested the 
relative performance of M-GURU and M-GURU-^ 2 on the toy-3 dataset. 



—Mi-guru 

M-GURU-S' 




1500 
updates 




(b) 



Figure 5.2: A typical run of M-GURU and M-GURU-5 2 on the toy-3 dataset. The loss is plotted 
against the number of updates that were performed. The S 2 variant appears to have an advantage 
in the descent phase. In the convergence phase, however, M-GURU takes the lead. Overall, the 
performance of both variants is pretty similiar. (a) linear scale, (b) semi-logarithmic scale. 
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Name 


M- 


M- 


SVM(%) 




GURU(%) 


GURU- 








S 2 (%) 




Toy-3 


98.67 


98 


98.67 


Toy-4 


96 


96 


96 


USPS 


94.67 


94.57 


94.857 


3,5,8 








USPS 


92.78 


92.7 


92.85 


0-9 








splice 


89.08 


89.08 


89.5 


wine 


92.31 


91.03 


92.31 



Table 5.2: Summary of the results. 




Figure 5.3: The toy problems used in the testing of M-GURU and M-GURU-S 2 . (a) Toy-3. (b) 
Toy-4. 

We observe that M-GURU outperforms the S 2 variant. Our experiments show that the 
empirical behavior of the classifiers stabilizes a significant time before the optimization 
process converges. Thus, M-GURU-^ 2 may be used to learn classifiers more quickly. 
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Chapter 6 
Discussion 



1. Contribution 

In this work we presented a new robust learning framework. In our framework we minimize 
the expected loss over a spreading of the sample points. Each displacement is assumed to 
take place with a probability that depends on its distance from the original point. Thus, we 
effectively replace each point with a fading cloud. 

We have analyzed the case of Gaussian noise distribution, where the underlying loss 
measure is the hinge-loss. In this case, we have shown that the resulting effective loss 
function is a smooth strictly-convex upper- approximation of the hinge-loss, denoted i^ge- 
One of the main advantages of this loss function, is its parameter a that has a clear mean- 
ing: the variance of the noise that contaminates the data. Similarly to SVM, our algorithm, 
named GURU, depends on a single parameter. A significant difference is the ability to 
assign a value to this parameter. In the case of SVM, for a long time all that was known 
on this parameter is that it controls the tradeoff between the training error and the margin 



of the classifier. |Xu et al.| p009| have shown that SVM is equivalent to a robust formu- 
lation in which the parameter corrsponds to the radius of a rigid ball in which the sample 
point may be displaced. This result, however, relates the parameter with the entire data 
set. Thus, it is still difficult to tune it. In our method, a is the magnitude of noise that 
possibly corrupts each sample point, hence it might be evaluated from physical consid- 
eration, such as the process that generates the data, etc. Without putting extra effort, we 
are able to point out an alternative explanation for non-regularized SVMs lack of ability 
to generalize. We have shown that as a tends to 0, P^ nge coincides asymptotically with 
the hinge loss. Thus, non-regularized SVM may be understood as not trying to acheive 
robustness to perturbations, hence it tends to overfit the data. We have shown that t t ^ ige 
may be written as a perspective of a smooth loss function (denoted /), where the scaling 
factor is a || w \\ . This representation suggests that the robust framework we have developed 
introduces a multiplicative regularization. Using both this representation we have derived 
a dual problem. The dual formulation depends on the actual loss function / only via its 
conjugate dual. Thus, it is possible to plug into the same formulation some other losses 
that follow certain conditions. In particular, as we have demonstrated in Chapter [3} there is 
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a tight connection between approximations of the loss function and relaxations of the dual 
problem. We believe that applying the same technique we have apllied here to other loss 
functions will result in new robust learning algorithms. The connection between the primal 
loss and the resulting dual shold be investigated more throughly. The algorithmic approach 
we have taken in this work is rather simplistic. Due to the fact that our objective is strictly- 
convex, many off-the-shelf convex optimization algorithms may be used. Our method of 
choice was stochastic gradient descent. Furthrmore, if there is a bound on the norm of the 



optimal classifier (as in SVM. see Shalev-Shwartz et al. [ 2007a | for details), it is probably 



possible to use it in order to achieve even faster algorithms. Specifically, subject to such a 
bound, we may restrict the optimization problem to a ball around the origin. In this ball, it 
is possible that our loss function is strongly-convex,hence it can be optimized using more 



aggressive procedure (Shalev-Shwartz and Kakade [2008 1). Our generalization to Mercer 



kernels, is done based on the primal formulation. In order to compute the updates fast 
(O(M)), we have shown how to maintain the value of the norm of the classifier in 0(1) 
based on pre-computed values. This technique may be employed in Pegasos, e.g, in order 
to perform the projection step efficiently. 

2. Generalizations 

The framework we have introduced may be generalized in couple of interesting directions. 
Obviously, various families of noise distributions may be plugged into the model. One 
particularly interesting is the class of all probabilty distributions having a specific first and 
second moment. |Vand enberghe et al. [2007 1 have shown that the probability of a set defined 



by quadratic inequalities may be computed using semidefinite programming. In addition, 
they have shown that the optimum is acheived over a discrete probability distribution. We 
conjecture that a similar technique may be employed in our case, in order to show that the 
optimum of the loss expectation is attained over a discrete distribution. In addition, the 
same framework can be used in order to explore more convex perturbations. For example, 
in the field of computer vision it is possible to assume that the adversary rotates or translates 
the sample, and that the distribution of these perturbations is chosen adversely. In order to 
make this practical, it is crucial to understand in which cases the integration and integration 
of the loss are possible. 

Regarding the theoretical aspects of this work, it still remains to show how to derive 
performance bounds for the introduced framework. In particular, it is interesting to un- 
derstand what kind of gurantees can be derived for the general perspective-optimization 
framework we have discussed. 
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Appendix A 

Single-Point Algorithms 



The object of this work is to learn classifiers that are robust to noise. As discussed, a possi- 
ble way to achieve this goal is by applying an adversarial framework. The most important 
issue in this case is designing an effective adversary. While in the previous chapters of 
the work we explored more sophisticated adversaries, it is nice to end the journey with a 
rather simple mathematical formulation. The binary version of the algorithms was exten- 
sively studied. We review the result here for the sake of a complete presentation. A simple 
generalization for the multiclass case is presented subsequently. 

1. Problem presentation 

Maybe the simplest action that the adversary can take at test-time is displacing a test point, 
in such a way that will cause this point to be missclassified. If we limit the freedom given to 
the adversary, it might not be able to corrupt the classification of the point, but rather only 
reduce the associated confiedence. The model that we will explore in the followings grants 
the adversary the ability to displace a sample point within a ball centered at the original 
point. 

In order for the learned classifier to be robust to such displacements, we should modify the 
objective of the learning task. In the following we present and anlyze one way to do it, by 
optimizaing the worst-case scenario: 



This formulation has an additive structure, in which each term Ax m appears exactly once. 
We use these properties in order to decouple the optimization problem. The learning task 
at hand in this case is thus 



A 

mm max — 

w j|Aa;™||<5: 771=1. .M 2 



ill 



+ 



(A.1) 



777=1 



M 



min — ||io|| + 
w 2 



771=1 




+ 



(A.2) 
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Recall that in the general SVM setting, one tries to minimize the hinge loss: 



'hinge 1 



x,y;w) = [1 -yw x] + (A.3) 



Equation A. 2 can be interpreted as optimizing the effective loss function 

q*(x,y;w)= max [1 - yw T (x + Ax)] + (A.4) 

5 ||Acc||<5 

We say that this loss function is robust, in the sense that it represents the worst-case loss 
subject to the potential action of the adversary. 

2. Computing the optimal displacement 

In order to derive a closed form for the loss function , we should explore the nature of 
the adversarial choice in our model. Intuitively, the adversary will try to relocate the point 
to the wrong side of the seperating hyperplane. For this end, it is pointless to move the 
point along any axes not orthogonal to the seperating hyperplane. This idea is visualized in 



Figure A.l We will now prove this simple theorem: 



Theorem 2.1: The optimum of the maximization in Equation A.4 is acheived at x opt = 

\\w\\ 

Proof: First we observe that the function f(z) = [1 — z] + is a monotone non-increasing 
function of its argument z. Thus, maximizing f(z) is equivalent to minimizing z. By the 
Cauchy-Schwartz inequality, we have that \yw T Ax\ < \\w\\ • ||Acc||, with equality iff Ax 
is proportional to w. Therefore, the minimal value possible is attained at Aaj opt = — 5p^y- 
We conclude that a3 opt = x — <5p^ as claimed. I 



Plugging the result of the theorem above into Equation A.4 we end up with 

£^ ge (x,y;w) = [l-yw T x + S\\w\\} + (A.5) 



3. ASVC: Adversarial Support Vector Classification 



The fact that Equation A.4| has a simple closed-form solution allows us to employ the 



algorithmic scheme of alternating optimization for Equation |A.l The structure of the 
algorithm is quite simple: 

1. Alternately: 

(a) Optimize for w 

(b) Optimize for Ax 1 , Ax 2 ,..., Ax M 

Until convergence. 
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Figure A. 1 : The adversarial displacement employed by AS VC 



Notice that [Talis nothing more than an SVM taking the displaced points as input. Fur- 



thermore, lb has a closed-form solution as we have proved in Theorem 2.1 Thus, to solve 
for the optimal classifier, any off-the-shelf SVM solver can be used. We end up with Algo- 
rithm [51 

Algorithm 5: ASVC(S, 5, X,T,k) 
Data: Training set S, radius 5, tradeoff A 
Result: The weight vector w 
w <— 0; 
repeat 

Ax m <- -<JA; 

\\w\\ 

S^{x m + Ax m } xmes ; 

w <- solveSVM(<S, A) 
until convergence ; 
return w ; 



4. The Multiclass Case 

Pretty similar ideas can be adopted in order to generalize ASVC for the multiclass case. 
The multi-hinge loss is defined as 

£ m uit(x m , y m ; w x , w 2 , w c ) = _max [5 y ^ - (w ym - w y ) T x m ] (A.6) 

y—l,2,...,C 

Using the notions of the previous section, we define 

dttO*™ w ^ w 2, •-, w c ) = max max [5 y , ym - (w y m - w y ) T (x m + Ax)] 

|| Aa3||<o y=l,2, ...,C 

(A.7) 
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Note the order of maximization can be changes, i.e. 



/jsingle^m „,m. 
"mult 



x m ,y m :wi,W2,---,Wc)= max max \5 V „m — (w v m — w v ) T (x m + Ax) 

y=X,2,...,C ||Ax||<5 L 

(A.8) 



Applying a slight variation of Theorem 2.1 we conclude with 

/jsingle/ mm \ 



max 

j/=l,2,...,C 



^y^y m — [uOym — ujy^j i x — St, : 



IliaX [Sy^ym — [W y ra — W y) T X™ + 5||lO y m W % 

=1,2, ...,C 



5. Related work 



Our ASVC algorithm is a mirror reflection of TSVC presented in (Bi & Zhang, NIPS04). 
TSVC performs alternating optimization, each time replacing the set of training samples 
with {x % + y J (?jT^ji-}, which are more distant from the separator (thus, easier to classify). 
The idea there is to address the case in which noisy data distracts the classifier, by using 
the shifted training sets. 



Figure A.2: The displacement employed by TSVC 
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Appendix B 
Diagonal Covariance 



In this appendix we discuss the case in which the adversary is constrained to choose a 
diagonal covariance matrix. This setting corresponds to the case when the noise is alligned 
to the primary axes. In this case we are able to give a closed form analytical result, subject 
to a bounded trace constraint on the covariance matrix. 
The adversarial choice problem can can be written 

max L(x m , y m ; w, w T T l w) (B.l) 

S=diag(ai,a2,...,a d ) tr(£)</3 

Let us expand 

w T T,w = tu T diag(ai, a 2 , . . . , ad)w 
d 

El T -1 

diWi = a w 

i=i 

where w' 2 represents the coordinate-wise product of w with itself. Let i* be the index of 
the maximal entry raw' 2 . It hold that 

w T Ew < ^2 fowl (B.2) 

i 

Using the same argumentation as in Chapter |2} we conclude that the adversary will choose 
the covariance matrix 

£* = (3e iH , (B.3) 

where is the matrix having zeros in all of its entries beside where it takes the 

value 1. The geometric meaning of this result is that the adversary will choose to spread 
the noise in a single direction, along the primary axis that creates the biggest angle with the 
separating hyperplane. 



61 



Figure B.l: Under the diagonal covariance restriction, the adversary will choose to spread the noise 
in a unique direction. This direction is the one that creates the biggest angle with the separating 
hyperplane. 
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Appendix C 

Using the Multi-Hinge Loss 



The most common generalization of the hinge loss for the multiclass case is the following 
loss function 

£ m uit{x m , y m ; U7i, • • • , w c ) = max [w^x m - w T ym x m + S Vjir ] (C.l) 



(see Crammer and Singer [2002]). In this appendix we point out some of the issues that 
made us choose to work with the sum-of-hinges loss function and not with the one above. 
If we plug the multi-hinge loss into our framework, we get the following learning prob- 

w EeS J y 1 y y 1 



Define Aw,, and write: 



And for Gaussian noise this is: 



min^maxc|S| 0,5 / e 2 nTs Vn max [Awy )V m x m + Aw y ^mn + 8 y ^ y m\ dn (C.4) 

m 

The ability to understand the solution of the adversarial choice problem in this case, is 
connected to the ability to understand the expectation of the maximum of a set of normal 
random variables. This problem probably does not have an analytical solution (see |Ross| 
[20031). 



UNIDIRECTIONAL NOISE 

In another approach we have studied, we assumed an adversary that spreads the noise in a 
single direction. The motivation for this kind of adversary is the solution to the adversarial 
choice problem in the binary case. 
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We formulate the problem by letting the adversary to choose a unit length vector. Thus, 
in the case of unidirectonal noise, the task that the adversary faces is: 

max^ J A4(0, a 2 ) max [/\w^ ym x m + AWy ym n + S y>y m\ dz (C.5) 

The integrand (excluding the pdf) is a piecewise linear function. The knees of this 
function as well as the slopes of the linear sections are strongly dependent on v. Nontheless, 
it is impossible to find a closed form solution for the position of the knees. Therefore, we 
find this direction inapplicable in our case, as well. 
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